Introduction to Open Data Science - Course Project

Chapter Topic Max points Book Ch (K.V.) Book Ch R for HDS Assignment done?
1 Start me up! 20 1 & 2 -
2 Regression and model validation 5+15 3 & 4 7
3 Logistic regression 5+15 5 & 6 7
4 Clustering and classification 15+5 - 12, 17 & 18
5 Dimensionality reduction techniques
6 Analysis of longitudinal data
Misc Misc

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Nov 27 23:30:53 2022"

The text continues here.

Hello, testing this!! ZOM!

Hello World, why are my commits not pushing through?

I was feeling happy and motivated! I got R, Rstudio, Git and Github up and running. Now I see my commits on Github but unfortunately my diary is not updating. Apparently, according to Github, the updating of a public page may take up to 24h. I feel like that is a very long time and I wonder if commits made directly on Github (not remotely with Git) would take as long.

I discovered the IODS-course on an email sent to all doctoral schools. I’m looking forward to learning about rMarkdown, project management in Github, and clustering.

Looking forward to this course!

Update: I got to attend our course IT-help-zoom. It was great! Even though my diary did not update despite our efforts. Apparently everything looks good and my diary should update if I wait some more. The info was reassuring so I’ll be waiting more serenely now!

Meanwhile, I’ll keep myself busy with R for Health Data Science as suggested during the course introduction lecture. It seems very informative and easy to read. I’m somewhat familiar with R but I haven’t used tidyverse much. Piping gets me a bit confused but it’s my favorite topic! Exercise Set 1 is interesting and useful, but it did require some package installations, but I’m good to go now and ready to keep working now that all packages are…packed along!

Update v2 We did it! The diary has updated itself after 2h 13min of me patiently waiting (and eagerly non-stop pushing new commits).

Link to my Github repo https://github.com/hsanez/IODS-project

Yay! A happy girl at a laptop By Marina Vishtak

Ch2 Regression and model validation

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Sun Nov 27 23:30:53 2022"

Here we go again…

Libraries used for Ch2

Load the needed R packages before getting to work: Obs! You might need to install the packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr"))
library(tidyverse)
library(GGally)

Data wrangling

1. Read data (1p.)

I created a new data folder in my IODS-project folder. It includes an R script named ‘create_learning2014.R’, which is the R code for this data wrangling part of Assignment 2. The original data I’m using is the full learning2014 data downloaded from here. The data includes headers and uses tab as a separator. As per the course assignment instructions I have commented the data and my code in ‘create_learning2014.R’.

✓ Done!

2. Create an analysis dataset (1p.)

In this part we created an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Some of these (attitude, deep, stra, surf, points) were made by combining questions in the learning2014 data, as defined in our Exercise Set and also on the bottom part of this page. The combination variables were scaled and observations where points = 0 were excluded. The data has now 166 observations and 7 variables. See my full comments and code in ‘create_learning2014.R’.

✓ Done!

3. Save new data set (3p.)

I set the working directory of my R session to the IODS Project folder with setwd() and then saved the analysis dataset to the data folder as learning.csv with write_csv() (readr package, part of tidyverse). ?write_csv was a good help. With read_csv(), str(), dim() and head() I made sure the data was correctly written. Again, see my full comments and code in ‘create_learning2014.R’.(3 points)

✓ Done!

Analysis

  • The Analysis exercise focuses on performing and interpreting regression analysis.
  • include all the codes, your interpretations and explanations in the R Markdown file chapter2.Rmd
  • the output of your Analysis should appear in your course diary (update your local index.html file by knitting the index.Rmd file (which includes chapter2.Rmd as a child file) and then push the changes to GitHub).
  • Write a continuous report with a clear structure.
  • For full points you should be able to show an understanding of the methods and results used in your analysis. Clear, understandable and comprehensive explanations are worth full points. -Feel free to also use material outside this course as learning sources.

R Markdown Hint: When you knit the document, the working directory is temporarily set to be the folder where your R Markdown file is located. This is good to be aware of when reading data for example.

REPORT (Assignment 2)

1. Read and describe data (3p.)

First we start by reading the analysis data to R. Make sure all required packages are installed (see beginning of Ch2). The data we’re using is the one created in the data wrangling part of assignment 2. The data can be also read directly from this page, where the separator is a comma and headers are included.

#load required packages
library(tidyverse)
library(GGally)
# Read data (Make sure you're in the correct working folder with getwd())
lrn14a <- read_csv("data/learning2014.csv")
# Alternative site for reading the data
#lrn14a <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt")
#Structure of the data
str(lrn14a)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Dimensions and first peek at the data
dim(lrn14a);head(lrn14a)
## [1] 166   7
## # A tibble: 6 × 7
##   gender   age attitude  deep  stra  surf points
##   <chr>  <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 F         53      3.7  3.58  3.38  2.58     25
## 2 M         55      3.1  2.92  2.75  3.17     12
## 3 F         49      2.5  3.5   3.62  2.25     24
## 4 M         53      3.5  3.5   3.12  2.25     10
## 5 M         49      3.7  3.67  3.62  2.83     22
## 6 F         38      3.8  4.75  3.62  2.42     21

The analysis data is a subset of a full data called ‘learning2014’. The full data is the result of a survey on attitude towards studying and statistics made between 3.12.2014 - 1.10.2015 for students of an introductory course to social statistics.

The results are student’s answers to questions on different subtopics of learning, and some of these subtopics are summarized as new variables in our analysis dataset. This subset we created and read, ‘learning2014.csv’, is an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Points means the maximum exam points of the student.

Some of these variables were made by combining questions in the learning2014 data

  • attitude = summary variable of questions about the student’s attitude towards statistics
  • deep = deep approach = summary variable of questions to assess whether the student is using a deep approach to learning
  • stra = strategic approach = summary variable of questions to assess whether the student is using a strategic approach to learning
  • surf = surface approach = summary variable of questions to assess whether the student is using a surface approach to learning

See the survey questions and combinations made for the above mentioned variables on this page.

The summary variables were scaled to a Likert scale (scale from 1 to 5) and observations (students) with 0 exam points were excluded.

The remaining subset of data has 166 observations (students) and 7 variables.

2. Graphical overview and summary (3p.)

We will look at a graphical overview of our data. We have 7 variables of which gender seems to be a dichotomous variable without missing values, and more female than male students:

summary(lrn14a$gender); unique(lrn14a$gender); table(lrn14a$gender)
##    Length     Class      Mode 
##       166 character character
## [1] "F" "M"
## 
##   F   M 
## 110  56

We’ll use this information to split our graphical output in two (male and female students) and draw summary plots of the remaining variables:

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

In this plot matrix we can see all variables split by gender, boxplots on top, and on the far left histograms, then scatters and density plots, and finally Pearson correlation data with statistical significance marked with asterisks (see GGally documentation).

As we can see, there are

  • more female (red) than male (blue) students.
  • Most students are under 30 years old, overall female students tend to be somewhat younger, the median age being lower.
  • More male students have a more positive attitude towards studying and statistics.
  • Both genders seem to have a high score on deep approach to studying but female students show a little bit more strategic and surface level approaches to studying than males.

For Pearson correlation results we can see both an overall and gender specific correlation coefficients.

  • It seems that attitude towards statistics and exam points are positively correlated on a statistically significant level (p<0.001): a higher summary score in attitude is associated with higher exam points regardless of gender.
  • Logically, we see surface level and deep level approach to studying to be negatively correlated (p<0.001) among male students but interestingly not among females.
  • There is also some signs of a negative correlation between attitude and surface level approach to studying among males.

3. Regression model fitting (4p.)

We’ll proceed to creating a regression model. As instructed we’ll choose three explanatory variables to our model where exam points is the dependent (outcome) variable. I’ll base my choice on the correlation information above, and I’ll try to avoid possible collinearity between the chosen variables. Since there are multiple explanatory (independent) variables, we’ll use a multiple regression model approach.

Dependent variable:

  • points

Independent variables:

  • attitude (statistically significant correlation with points)
  • stra (large correlation coefficient)
  • surf (large correlation coefficient)
library(ggplot2)
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# create a regression model with multiple explanatory variables
reg_model <- lm(points ~ attitude + stra + surf, data = lrn14a)

# print out a summary of the model
summary(reg_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From the summary of the model we can see that

  • attitude is positively and statistically significantly (p=1.9E-8) associated with exam points. From the estimated we can assume that a rise of 1 point in the mean of attitude points (as in a summary variable of attitude driven questions) means an approximate rise of 3.4 exam points when the other explanatory variables remain unchanged.
  • Strategic or surface approach to studying seem to not be associated with exam points.

The t values are t statistics that have been calculated as estimate/standard error.

The F-statistic has a very low p-value, meaning that there is evidence that at least not all of the coefficients of the studied explanatory variables are zero.

R-squared tells us about the fit of the model to our data. Here, since we have used multiple explanatory variables, we’ll look at the adjusted R-squared which takes into account and corrects for the variance caused by the multiple variables used in a regression model. Here we see an adjusted R-square of 0.19, meaning that the three chosen explanatory variables account for approx. 20% of the variation in exam points.

Since only attitude seems to be statistically significantly associated with the dependent variable, we’ll rerun the regression model. We’ll keep attitude as an explanatory variable and instead of surf and stra, we’ll test age and gender as explanatory variables.

# create a regression model with multiple explanatory variables
reg_model_extra <- lm(points ~ attitude + stra + surf, data = lrn14a)

# print out a summary of the model
summary(reg_model_extra)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

4. Results - model simplification (3p.)

Since only attitude seems to be statistically significantly associated with the dependent variable, points, we’ll run the regression model once more, this time only with attitude as an explanatory variable, to fit the model better and get a more parsimonious model. We’ll also draw a scatter plot with a regression line to visualize the result.

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = lrn14a) + geom_smooth(method = "lm")

# create a regression model with multiple explanatory variables
reg_model2 <- lm(points ~ attitude, data = lrn14a)

# print out a summary of the model
summary(reg_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The regression line fits the scatter plot quite nicely, with a positive slope and tight 95% confidence intervals. Again, we see that attitude is statistically significantly associated with points (p=4.1E-9), and the amount of points rise by 1 when attitude points based on the original survey rise by approx 3.5. The estimate is a little bit higher and the p-value a little bit lower than before. Since we have only one explanatory variable we can assess the multiple R squared of the model (0.19), which should now equal to the square of the correlation coefficient between attitude and points. This means that almost 20% of points can be explained by attitude.

5. Diagnostic plots (3p.)

We have fitted a regression model to our data and interpreted the results. Now we will assess the model to check whether it complies to our statistic assumptions, and that it is a sensible model to use on our data.

Graphical model validation with plot():

  • Residuals vs Fitted values (1)
  • Normal QQ-plot (2)
  • Residuals vs Leverage (5)
# par() could be used to show plots in a matrix
# par(mfrow = c(2,2))
# draw diagnostic plots using the plot() function.
plot(reg_model2, which=c(1,2,5))

Linear regression modeling has four main assumptions (quoted from R for Health Data Science):

  1. Linear relationship between predictors and outcome
  2. Independence of residuals
  3. Normal distribution of residuals
  4. Equal variance of residuals
  • In plot ‘Residuals vs Fitted’ values we see that the variance of the residuals (outcome variable = points) seem to stay equal (assumption 4 ✓).
  • In the Normal QQ plot we see quite normally distributed residuals, some residual data points curving from the normal line in both ends of the plot line, which means that there are some extreme values in the data but the distributions seems normal (assumption 3 ✓)
  • From the plot ‘Residuals vs Leverage’ we can estimate whether there are observations that affect the model significantly more than others (outliers). As from the QQ plot we see that there are some extreme values but none of them cross the Cook’s distance treshold so we can determine that there are no critical outliers affecting the our model.

List of diagnostic plots for plot() and which():

which graphic
1 Residuals vs Fitted values
2 Normal QQ-plot
3 Standardized residuals vs Fitted values
4 Cook’s distances
5 Residuals vs Leverage
6 Cook’s distance vs Leverage

✓ ✓ ✓ ✓ ✓ Done!

Ch3 Logistic regression and cross-validation

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Sun Nov 27 23:31:07 2022"

Here we go yet again…

Libraries used for Ch3

Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr"))

Load the needed R packages before getting to work:

#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)

Data wrangling

See create_alc.R on my Github repository.

Data analysis

REPORT (Assignment 3)

The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in our data.

2. Read and describe the data

Read the analysis data.

# Read data (Make sure you're in the correct working folder with getwd())
alc <- read_csv("data/student_alc.csv")
# Alternative site for reading the data
#alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv")

Describe the data.

# print variable names and dimension of the tibble
colnames(alc); dim(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
## [1] 370  35

The analysis dataset we are using for this course assignment consists of the above listed 35 variables and 370 observations.

Our analysis dataset is a subset of an original data ‘Student Performance Data Set’ by Prof. Cortez (University of Minho, Portugal) collected from two Portuguese schools during the school year 2005-2006. The original dataset was collected using questionnaires and school reports.

The analysis dataset was created by joining two datasets of the original data; student data from two school subjects: Mathematics and Portuguese language. These subject-specific datasets were merged by their common variables, and only students present in both subject datasets were included in the resulting analysis dataset. Naturally there was variation in the following time-related variables:

  • failures = number of past class failures (numeric: n if 1<=n<3, else 4)
  • paid = extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • absences = number of school absences (numeric: from 0 to 93)
  • G1 = first period grade (numeric: from 0 to 20)
  • G2 = second period grade (numeric: from 0 to 20)
  • G3 = final grade (numeric: from 0 to 20, output target)

For the numeric variables we calculated student-specific averages using both of the datasets (Mathematics and Portuguese language). For the binary variable (paid) we chose the value from the Mathematics dataset.

See the R code for the exact course of our data wrangling.

You can find the variable descriptions and more information about (and download) the original dataset on this page.

3. Hypotheses on the relationships between alcohol consumption and chosen variables

My chosen variables:

  • age = student’s age (numeric: from 15 to 22)
  • sex = student’s sex (binary: ‘F’ - female or ‘M’ - male)
  • famrel = quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • absences = number of school absences (numeric: from 0 to 93)

The original dataset includes information on students from 15 to 22 years. This means all student’s aren’t of legal drinking age, which seems to be 18 in Portugal. They won’t be able to buy alcohol by themselves which could be seen in the relationship between age and alcohol use; there should be less alcohol use among younger students.

In average, men are larger than women, and their body composition differs, which makes their physiology and thus alcohol metabolism different from women’s. This results in male tolerance for alcohol being usually better, which may lead to them drinking more before feeling the effects of alcohol. For this possible reason I think there might be a relationship between sex and alcohol use.

Students might spend a lot of time at home both studying or during their freetime. Thus, bad family relations could affect studying possibilities, motivation and recovering from studying. One reason for bad family relationships may be alcohol itself, maybe overused by the student itself or by other family members.

Students drinking more alcohol might have more absences from school. Since weekday alcohol consumption is measured it could be seen as absences during the school week. Friday or Saturday use might not come up as absences in the data, assuming that these schools in Portugal have a regular Monday to Friday school week schedule.

My hypotheses:

  • More alcohol use among older students.
  • More alcohol use among male students.
  • More alcohol use among students with worse family relationships.
  • More alcohol use among students with more school absences.

4. Descriptive analyses

# Load needed libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(GGally)

# glimpse at the data
glimpse(alc); dim(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## [1] 370  35
# create a vector of the chosen variables + alcohol use
myvars <- c('age', 'sex', 'famrel', 'absences', 'alc_use', 'high_use')

# pivot data and draw a bar plot of the chosen variables
gather(alc[myvars]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

# draw summary table grouped by sex and low/high use of alcohol consumption
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = round(mean(age),1), mean_family_relation_quality = round(mean(famrel),1), mean_absences = round(mean(absences),1))
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count mean_age mean_family_relation_quality mean_absences
##   <chr> <lgl>    <int>    <dbl>                        <dbl>         <dbl>
## 1 F     FALSE      154     16.6                          3.9           4.3
## 2 F     TRUE        41     16.5                          3.7           6.9
## 3 M     FALSE      105     16.3                          4.1           2.9
## 4 M     TRUE        70     16.9                          3.8           6.1
# draw summary plot matrix stratified for sex as an overview of the data and correlations
p <- ggpairs(alc[myvars], mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

  • Absences:
    • Most students have either only a few (0-4) absences or very many (20-44) absences.
    • There is also a portion of students that have an average amount of absences (median=12).
    • Female students with high use of alcohol have the most absences.
    • Male students with no high alcohol use have the least amount of absences.
    • There seems to be a statistically significant positive correlation between alcohol use and absence (R=-0.123) regardless of sex.
  • Age:
    • Students are between 15 and 22, most of them under 19.
    • The correlation analysis implies that there is a statistically significant positive correlation between age and alcohol use among male students.
  • Family relations:
    • Most students seem to have good (4-5 on Likert scale) family relations.
    • Male students who don’t use high amounts of alcohol seem to have the best family relations and it seems to correlate statistically significantly with alcohol use.
  • Alcohol use:
    • There are more students who don’t use high amounts of alcohol (n=259).
    • Most of students who don’t use high amounts of alcohol are female students (n=154).
    • There are more male students who use high amounts of alcohol compared to female students (n=41).
  • Sex:
    • There are more female (n=195) than male students (n=175).
  • My hypotheses:
    • More alcohol use among older students. –> True among male students!
    • More alcohol use among male students. –> True!
    • More alcohol use among students with worse family relationships. –> True among male students!
    • More alcohol use among students with more school absences. –> True!

Nicely hypothesized - or - as Jorma Uotinen might say:

5. Regression analyses

Our outcome variable is dichotomous, or binary, so we will assess its relationship with our chosen variables with a logistic regression model.

Dependent variable:

  • High/low alcohol use (high_use) (dichotomous)

Independent variables:

  • age (numerical)
  • sex (dichotomous)
  • quality of family relations (famrel) (ordinal)
  • number of school absences (absences) (numerical)
library(dplyr); library(ggplot2)

# First, we define our logistic regression model m
# We'll use glm(), a generalized linear model function so we need to specify our model type to be logistic -> family="binomial", which makes the function use the link option 'logit'.
m <- glm(high_use ~ age + sex + famrel + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + famrel + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2151  -0.8371  -0.5950   1.0577   2.1545  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.44606    1.75957  -1.958 0.050175 .  
## age          0.17148    0.10306   1.664 0.096143 .  
## sexM         1.09171    0.24864   4.391 1.13e-05 ***
## famrel      -0.32224    0.12913  -2.495 0.012579 *  
## absences     0.08915    0.02298   3.879 0.000105 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 407.03  on 365  degrees of freedom
## AIC: 417.03
## 
## Number of Fisher Scoring iterations: 4
# Compute odds ratios (OR) and their confidence intervals (CI) for each variable
OR <- coef(m) %>% exp %>% round(2)
CI <- confint(m) %>% exp %>% round(2)
# print ORs and CIs as a table
cbind(OR, CI)
##               OR 2.5 % 97.5 %
## (Intercept) 0.03  0.00   0.98
## age         1.19  0.97   1.46
## sexM        2.98  1.84   4.89
## famrel      0.72  0.56   0.93
## absences    1.09  1.05   1.15
  • Sex, Family relations and absences seem to have a statistically significant association with high alcohol use (p<0.01).

  • Odds ratio is $ e^{}$

    • Male students have a 2.98 times the odds of using high amounts of alcohol compared to female students. The 95% confidence intervals are 1.8-4.9 (both above 1), which implies a statistically significant association.
    • Good family relations seem to be associated with less use of alcohol, the odds ratio is 0.72 (CI 0.6-0.9, both below 1) implying that good family relations are statistically significantly associated with a 28% (1-0.72) reduction on the risk of using high amounts of alcohol use.
    • Absences seem to be statistically significantly associated with high use of alcohol. The increase of school absences by one is associated with an increase of 9% in the odds of high alcohol use.
  • Age has no statistically significant association with high level of alcohol use.

  • My hypotheses:

    • More alcohol use among older students. –> True WRONG among male students!
    • More alcohol use among male students. –> True!
    • More alcohol use among students with worse family relationships. –> True among male students!
    • More alcohol use among students with more school absences. –> True!

In summary, three of my original hypothesis seem to stay valid after our correlation and regression analyses.

One hypothesis down! Oh no! But no worries…

6. Predictive power of the model

To predict the power of our model we will create a ‘confusion matrix’, some graphs (bar plot, scatter plot and ROC curve), and compute the training error of the model. Finally we’ll compare the performance of the model with a simple guessing strategy. Later, in part 7 of this report, we’ll test the performance of the model with cross-validation.

As instructed, we’ll use only the variables that showed statistical significance: sex, famrel, absences. Age had no statistically significant association with alcohol use level based on our regression model.

# predict() the probability of high_use
# We're using our model 'm' as the object of the predict()
# type = 'response' defines that we want our prediction results on the scale of the response variable, instead of the default scale. This means that when dealing with a binomial target variable (e.g. our 'high_use') we will get our prediction results as predicted probabilities instead of logit-scaled probabilities (default for binomial ).
probabilities <- predict(m, type = "response")

library(dplyr)
# add the predicted probabilities to our dataframe 'alc' in a column named 'probability'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
# Let's set our prediction treshold: If the probability is more than 50%, prediction = TRUE, otherwise it's FALSE.
alc <- mutate(alc, prediction = probabilities >0.5)

# check the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, famrel, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
##    sex   famrel absences high_use probability prediction
##    <chr>  <dbl>    <dbl> <lgl>          <dbl> <lgl>     
##  1 M          4        3 FALSE          0.387 FALSE     
##  2 M          4        0 FALSE          0.405 FALSE     
##  3 M          5        7 TRUE           0.437 FALSE     
##  4 F          5        1 FALSE          0.132 FALSE     
##  5 F          4        6 FALSE          0.247 FALSE     
##  6 F          5        2 FALSE          0.165 FALSE     
##  7 F          4        2 FALSE          0.187 FALSE     
##  8 F          1        3 FALSE          0.398 FALSE     
##  9 M          2        4 TRUE           0.568 TRUE      
## 10 M          4        2 TRUE           0.406 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% addmargins()
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   247   12 259
##    TRUE     76   35 111
##    Sum     323   47 370
# transform to proportions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins() %>% round(2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.67 0.03 0.70
##    TRUE   0.21 0.09 0.30
##    Sum    0.87 0.13 1.00

Almost one third of students drink alcohol an amount classifiable as ‘high use’ and our model underestimates the amount by predicting that only 11% of students use that much alcohol.

We can visualize our observations and predictions with a bar plot:

# initialize a plot of 'prediction' (predicted outcomes)
#basic colors
#p2 <- ggplot(data = alc, aes(x=prediction, col=prediction, fill=prediction))
# Custom colors
pred_colors <- c("#FFA500","#00FF00")
p2 <- ggplot(data = alc, aes(x=prediction, fill=prediction)) +
  scale_fill_manual(values=pred_colors)

# draw a bar plot of high_use by 'high use' (observations)
# delineate outcome variable labels
## define new labels
outlab <- c('True negative','True positive')
## Define old labels in same order
names(outlab) <- c('FALSE','TRUE')
p2 + geom_bar() + facet_wrap("high_use", labeller=labeller(high_use=outlab))

There are more FALSE observations (True negative observation = no high use of alcohol, left panel) than TRUE (True positive observation = high use of alcohol, right panel) observations. The prediction of true negative observations seems to be better than the prediction of true positive observations.

This can be presented also with a sleeker scatter plot:

# initialize a plot of 'high_use' versus 'probability' in 'alc'
p3 <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# define the geom as points and draw the plot
p3 + geom_point()


We can also visualize our confusion matrix with a ROC (receiver operating characteristic) curve. This will show the performance of our classification model with the true positive and false positive rate.

We’ll visualize the confusion matrix (sensitivity and specificity) with a ROC (receiver operating characteristic) curve. This will help us assess how well our logistic regression model fits the dataset.

  • Sensitivity = the probability that the model predicts a true positive outcome of the observations
  • Specificity = the probability that the model predicts a true negative outcome of the observations (=1-FPR)

The ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR).

  • TPR = TP/(TP+FN)
  • FPR = FP/(TN+FP)

The closer the curve comes to the upper left corner of the plot, the more accurate the test is. THe closer to the grey 45’ line the curve gets, the less accurate the test is.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Define the curve with function roc(response, predictor), see ?roc() and pROC documentation (Resources)
roc_obj <- roc(alc$high_use, alc$probability)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
# Draw the plot
plot(roc_obj)

We can see that the curve differs from the 45’ curve, which means that our model seems to have at least some predictive value.

Training error

To calculate the training error (=the total proportion of inaccurately classified individuals) of our model, we’ll run the loss function (mean prediction error) presented in our course exercises loss_func(response, probability). It will take into account a binomial response/classification variable (‘high_use’) and the probability of the response to be TRUE for each individual.

  • class is either TRUE (1) or FALSE (0) for our variable of interest (‘high_use’)
  • prob (float between 0 and 1) is the individual probability that they are classified as TRUE for ‘high_use’

abs() gives the absolute value of the difference between class and prob, so if it is >0.5, the prediction has been wrong.

The function assigns the logicals ‘TRUE’ or ‘FALSE’ depending on whether the model’s prediction is false or true. OBS! The function assigns now ‘TRUE’ = 1, when the prediction has been false! Then it calculates the mean prediction error which is the number of false predictions divided by the number of total observations.

\[ \frac{n(FP + FN)*1 + n(TP + TN)*0}{N} \]

The function:

# define the loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  # print(n_wrong) #print this if you want to see the logicals the final result mean() is based on
  mean(n_wrong)
}


# call loss_func() to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378

A simple guessing strategy might be just guessing that no student ever drinks enough for it to be classified as ‘high use’ (high_use = FALSE, probability(high_use=TRUE)= 0) or all drinking very much (high_use = TRUE, probability(high_use=TRUE) = 1) or that every other student drinks (probability(high_use=TRUE) = 0.5). These situations can be tested with the loss_func() by determining a constant individual probability (prob) as described above.

# No students' drinking habits are 'high use', high_use=FALSE, probability(high_use=TRUE)= 0
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# All students' drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 1
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
# Every other student's drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 0.5
loss_func(class = alc$high_use, prob = 0.49)
## [1] 0.3
loss_func(class = alc$high_use, prob = 0.51)
## [1] 0.7
# Every third student's drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 0.3
loss_func(class = alc$high_use, prob = 0.3)
## [1] 0.3

The training error is 23.0%, which means that the models accuracy (1 - training error) is approximately 76%. By simply guessing how many students’ drinking habits are classified ‘high use’ we did not reach higher accuracy unless we guessed that every other students drinks:

Guess classified as ‘high use’ Guessed probability of ‘high use’ drinking Training error Accuracy
No one drinks 0 0.3 70%
Everybody drinks 1 0.7 30%
Every other drinks* 0.49 & 0.51 0.3 & 0.7 70% & 30%
Every third drinks 0.3 0.3 70%

OBS! *This assessment of exactly every other drinking (prob=0.50) can’t be done since the function is defined with a 0.5 treshold. Thus the close approximates were used.

In summary, our model seems to be better in predicting the level of alcohol use of our students than simple guessing.

7. Bonus: 10-fold cross-validation

Quoted from our course material: Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good. Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cvK10 <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cvK10$delta[1]
## [1] 0.2405405

The mean prediction error (0.25) combined from the rounds on the testing data seems to be a little bit lower than on the training data (0.24).

The model introduced in our Exercise set 3 had about 0.25 mean prediction error on the 10-fold cross validation on the testing data which implies that our new model (variables: sex, famrel, absences) has a similar test set performance to the one used in the Exercise set (variables: sex, failures, absences).

8. Super-Bonus: Comparison of logistic regression models with different sets of predictors

We’ll do a comparison of 11 different testing models, substracting the number of predictors in each model by always keeping the most significant variables (dropping the last 5 variables then later excluding only one variable per model).

Defining the models (logistic regression) and assigning predictions based on prediction probabilities

# check all possible variables
# -> 31 possible independent variables (we'll omit Daily and Weekend alcohol use since they were used for composing our outcome variable!)
colnames(alc)

# models and summaries
# detecting the least statistically significant variables
m1 <- glm(high_use ~  school + sex + age + address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup + famsup + activities + nursery + higher + internet + romantic + famrel + freetime + goout + health + failures + paid + absences + G1 + G2 + G3 , data = alc, family ="binomial")
summary(m1)

piis <- summary(m1)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + famrel + absences + reason + sex + guardian + address + activities + freetime + Mjob + health + paid + famsize + romantic + Fjob + studytime + traveltime + nursery + failures + G1 + school + Fedu + G2 + age + Pstatus + G3 + higher + internet + famsup + Medu + schoolsup

m2 <- glm(high_use ~  goout + famrel + absences + reason + sex + guardian + address + activities + freetime + Mjob + health + paid + famsize + romantic + Fjob + studytime + traveltime + nursery + failures + G1 + school + Fedu + G2 + age + Pstatus + G3, data = alc, family ="binomial")
summary(m2)

piis <- summary(m2)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + reason + sex + guardian + address + activities + freetime + Mjob + paid + health + famsize + studytime + traveltime + nursery + romantic + failures + G1 + school + Fjob + Fedu + G2 + age + Pstatus + G3

m3 <- glm(high_use ~  goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob + traveltime + nursery + school + failures + G1, data = alc, family ="binomial")
summary(m3)

piis <- summary(m3)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob + traveltime + nursery + school + failures + G1


m4 <- glm(high_use ~  goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob, data = alc, family ="binomial")
summary(m4)

piis <- summary(m4)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + sex + reason + address + guardian + health + Mjob + studytime + activities + paid + famsize + romantic + Fjob + freetime


m5 <- glm(high_use ~  goout + absences + famrel + sex + reason + address + guardian + health + Mjob + studytime + activities, data = alc, family ="binomial")
summary(m5)

piis <- summary(m5)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + absences + famrel + sex + reason + guardian + address + activities + studytime + health + Mjob


m6 <- glm(high_use ~  goout + absences + famrel + sex + reason + guardian, data = alc, family ="binomial")
summary(m6)

piis <- summary(m6)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences + famrel + reason + guardian


m7 <- glm(high_use ~  goout + sex + absences + famrel + reason, data = alc, family ="binomial")
summary(m7)

piis <- summary(m7)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences + famrel + reason


m8 <- glm(high_use ~  goout + sex + absences + famrel, data = alc, family ="binomial")
summary(m8)

piis <- summary(m8)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences + famrel


m9 <- glm(high_use ~  goout + sex + absences, data = alc, family ="binomial")
summary(m9)

piis <- summary(m9)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex + absences


m10 <- glm(high_use ~  goout + sex, data = alc, family ="binomial")
summary(m10)

piis <- summary(m10)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order:  goout + sex 


m11 <- glm(high_use ~  goout, data = alc, family ="binomial")
summary(m11)



# predict() the probability of high_use
probabilities1 <- predict(m1, type = "response")
probabilities2 <- predict(m2, type = "response")
probabilities3 <- predict(m3, type = "response")
probabilities4 <- predict(m4, type = "response")
probabilities5 <- predict(m5, type = "response")
probabilities6 <- predict(m6, type = "response")
probabilities7 <- predict(m7, type = "response")
probabilities8 <- predict(m8, type = "response")
probabilities9 <- predict(m9, type = "response")
probabilities10 <- predict(m10, type = "response")
probabilities11 <- predict(m11, type = "response")

# add the predicted probabilities to our dataframe 'alc' in a column named 'probability'
alc <- mutate(alc, probability1 = probabilities1)
alc <- mutate(alc, probability2 = probabilities2)
alc <- mutate(alc, probability3 = probabilities3)
alc <- mutate(alc, probability4 = probabilities4)
alc <- mutate(alc, probability5 = probabilities5)
alc <- mutate(alc, probability6 = probabilities6)
alc <- mutate(alc, probability7 = probabilities7)
alc <- mutate(alc, probability8 = probabilities8)
alc <- mutate(alc, probability9 = probabilities9)
alc <- mutate(alc, probability10 = probabilities10)
alc <- mutate(alc, probability11 = probabilities11)

# use the probabilities to make a prediction of high_use
# Let's set our prediction treshold: If the probability is more than 50%, prediction = TRUE, otherwise it's FALSE.
alc <- mutate(alc, prediction1 = probabilities1 >0.5)
alc <- mutate(alc, prediction2 = probabilities2 >0.5)
alc <- mutate(alc, prediction3 = probabilities3 >0.5)
alc <- mutate(alc, prediction4 = probabilities4 >0.5)
alc <- mutate(alc, prediction5 = probabilities5 >0.5)
alc <- mutate(alc, prediction6 = probabilities6 >0.5)
alc <- mutate(alc, prediction7 = probabilities7 >0.5)
alc <- mutate(alc, prediction8 = probabilities8 >0.5)
alc <- mutate(alc, prediction9 = probabilities9 >0.5)
alc <- mutate(alc, prediction10 = probabilities10 >0.5)
alc <- mutate(alc, prediction11 = probabilities11 >0.5)

The order of the variables (p-value) changed between regressions. However, the ‘top’ variables’ significance stayed and they kept their top positions on the regression summaries through all the regressions. The less significant variables changed rankings often, and thus they happened to be the ones to primarily get discarded from the model.

Nota bene!

I didn’t get my for-loops nor dynamic functions to work so part 8 scripts have been and will continue being ugly…sorry! But don’t worry, since…

Kittycat suffering from ugly-scriptitis

Training and testing errors by model

# Models numbered by amount of variables used
modelss <- c(31,26,21,16,11,6,5,4,3,2,1)

# call loss_func() to compute the average number of wrong predictions in the (training) data
pred_train <- loss_func(class = alc$high_use, prob = alc$probability1)
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability2))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability3))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability4))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability5))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability6))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability7))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability8))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability9))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability10))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability11))

#10-fold cross validation on testing data
cvK10_ss1 <- cv.glm(data = alc, cost = loss_func, glmfit = m1, K = 10)
cvK10_ss2 <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cvK10_ss3 <- cv.glm(data = alc, cost = loss_func, glmfit = m3, K = 10)
cvK10_ss4 <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cvK10_ss5 <- cv.glm(data = alc, cost = loss_func, glmfit = m5, K = 10)
cvK10_ss6 <- cv.glm(data = alc, cost = loss_func, glmfit = m6, K = 10)
cvK10_ss7 <- cv.glm(data = alc, cost = loss_func, glmfit = m7, K = 10)
cvK10_ss8 <- cv.glm(data = alc, cost = loss_func, glmfit = m8, K = 10)
cvK10_ss9 <- cv.glm(data = alc, cost = loss_func, glmfit = m9, K = 10)
cvK10_ss10 <- cv.glm(data = alc, cost = loss_func, glmfit = m10, K = 10)
cvK10_ss11 <- cv.glm(data = alc, cost = loss_func, glmfit = m11, K = 10)


# average number of wrong predictions in the cross validation
cvK10$delta[1]
## [1] 0.2405405
cvK10_ss <- cvK10_ss1$delta[1]
cvK10_ss <- append(cvK10_ss,cvK10_ss2$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss3$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss4$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss5$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss6$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss7$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss8$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss9$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss10$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss11$delta[1])


# create a tibble
supbon <- tibble(modelss, pred_train, cvK10_ss)

Trends of both training and testing errors by the number of predictors in the model

# Pivot data
longdf <- pivot_longer(supbon, c('pred_train','cvK10_ss'))

# initialize and draw a plot
ggplot(data=longdf, aes(x=modelss, y=value, col=name)) + geom_point() + geom_line() +
  # Edit legend title and labels
  scale_color_discrete(name = "Mean prediction error", labels = c("testing", "training")) +
  # Edit label names and title
  labs(x="Model", title='Trends of training and testing errors by the number of predictors in the model')

The mean prediction error is:

  • always lower in the training set.
  • very high in both the testing and training set when the model has only 1-2 predictors.
  • quite low in both the testing and training set when the model has 3-5 predictors.
  • quite high in both sets when there are 6-16 predictors in the model.
  • low in the training set and very high in the testing set when there are 21-31 predictors in the model.

\(\rightarrow\) The predictive logarithmic regression model is at its best with not too few nor too many variables \(\rightarrow\) 3-5 variables seems like a good option for model construction.

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Done!

Ch4 Clustering and classification

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
date()
## [1] "Sun Nov 27 23:31:18 2022"

Here we go again…

Libraries used for Ch4

Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot"))

Load the needed R packages before getting to work:

#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)
library(MASS)
library(corrplot)

Data analysis (15p.)

2. Loading and describing the data ‘Boston’

# load the 'Boston' data
data("Boston")

# explore the dataset
str(Boston);dim(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14

The ‘Boston’ dataset from the MASS library is a data frame of Housing Values in Suburbs of Boston. It has 506 observations (rows) and 14 variables (columns).

The variables are:

  • crim = per capita crime rate by town.
  • zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus = proportion of non-retail business acres per town.
  • chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox = nitrogen oxides concentration (parts per 10 million).
  • rm = average number of rooms per dwelling.
  • age = proportion of owner-occupied units built prior to 1940.
  • dis = weighted mean of distances to five Boston employment centres.
  • rad = index of accessibility to radial highways.
  • tax = full-value property-tax rate per $10,000.
  • ptratio = pupil-teacher ratio by town.
  • black = 1000(Bk - 0.63)^2 where BkBk is the proportion of blacks by town.
  • lstat = lower status of the population (percent).
  • medv = median value of owner-occupied homes in $1000s.

The R documentation of the dataset can be found here.

3. A graphical overview and summary of the data

# summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# data from wide to long
long_Boston <- pivot_longer(Boston, cols=everything(), names_to = 'variable', values_to = 'value')

# Histograms of all variables in use
p1 <- ggplot(data=long_Boston)
p1 + geom_histogram(mapping= aes(x=value)) +
  facet_wrap(~variable, scales="free")

# Summary plot matrix with correlation and density plots
p2 <- ggpairs(Boston, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p2

All variables of the dataset ‘Boston’ are numeric. Most of the variables are skewed, so we’ll use Spearman’s correlation for correlation assessment later. There are multiple variables with possible outliers. Correlation data is difficult to see from this plot with this many variables and shows Pearson’s correaltions by default, so we’ll draw another one using the cor() (from the ‘corrplot’ library) focusing on the correlations between variables.

# calculate the correlation matrix
cor_matrix <- cor(Boston, method='spearman') 

# print the correlation matrix
cor_matrix %>% round(1)
##         crim   zn indus chas  nox   rm  age  dis  rad  tax ptratio black lstat
## crim     1.0 -0.6   0.7  0.0  0.8 -0.3  0.7 -0.7  0.7  0.7     0.5  -0.4   0.6
## zn      -0.6  1.0  -0.6  0.0 -0.6  0.4 -0.5  0.6 -0.3 -0.4    -0.4   0.2  -0.5
## indus    0.7 -0.6   1.0  0.1  0.8 -0.4  0.7 -0.8  0.5  0.7     0.4  -0.3   0.6
## chas     0.0  0.0   0.1  1.0  0.1  0.1  0.1 -0.1  0.0  0.0    -0.1   0.0  -0.1
## nox      0.8 -0.6   0.8  0.1  1.0 -0.3  0.8 -0.9  0.6  0.6     0.4  -0.3   0.6
## rm      -0.3  0.4  -0.4  0.1 -0.3  1.0 -0.3  0.3 -0.1 -0.3    -0.3   0.1  -0.6
## age      0.7 -0.5   0.7  0.1  0.8 -0.3  1.0 -0.8  0.4  0.5     0.4  -0.2   0.7
## dis     -0.7  0.6  -0.8 -0.1 -0.9  0.3 -0.8  1.0 -0.5 -0.6    -0.3   0.2  -0.6
## rad      0.7 -0.3   0.5  0.0  0.6 -0.1  0.4 -0.5  1.0  0.7     0.3  -0.3   0.4
## tax      0.7 -0.4   0.7  0.0  0.6 -0.3  0.5 -0.6  0.7  1.0     0.5  -0.3   0.5
## ptratio  0.5 -0.4   0.4 -0.1  0.4 -0.3  0.4 -0.3  0.3  0.5     1.0  -0.1   0.5
## black   -0.4  0.2  -0.3  0.0 -0.3  0.1 -0.2  0.2 -0.3 -0.3    -0.1   1.0  -0.2
## lstat    0.6 -0.5   0.6 -0.1  0.6 -0.6  0.7 -0.6  0.4  0.5     0.5  -0.2   1.0
## medv    -0.6  0.4  -0.6  0.1 -0.6  0.6 -0.5  0.4 -0.3 -0.6    -0.6   0.2  -0.9
##         medv
## crim    -0.6
## zn       0.4
## indus   -0.6
## chas     0.1
## nox     -0.6
## rm       0.6
## age     -0.5
## dis      0.4
## rad     -0.3
## tax     -0.6
## ptratio -0.6
## black    0.2
## lstat   -0.9
## medv     1.0
# visualize the correlation matrix
# cl.pos= position of the color legend, e.g. cl.pos = 'b'
# tl.pos= position of the text labels
# tl.cex = size of text labels
# type = type of plot, 'full', 'upper', 'lower'
# method = the visualization of the correlation coefficients, 'circle', 'square', 'pie', 'number', etc.
# Basic one-style correlation plot: corrplot(cor_matrix, method="ellipse", tl.pos ="d", tl.cex=0.6)
# Cool mixed correlation plot:
corrplot.mixed(cor_matrix, lower = 'number', upper = 'ellipse',tl.pos ="lt", tl.col='black', tl.cex=0.6, number.cex = 0.5)

The darker the color (see color legend, either blue or red) and the more straight-line like ellipse, the larger the correlation coefficient between variables. The exact coefficient can be seen on the lower part of the plot. The slope and the color (blue vs. red) of the ellipses depict the direction (positive vs negative) of the correlation. Unfortunately, with cor() we don’t have the p-values for the correlations.

There seems to be some strong negative (= more of A correlates to more of B) correlations between:

  • nitrogen oxide concentration (nox) and the weighted mean of distances to five Boston employment centres (dis).
  • percent of lower status of the population (lstat) and the median value of owner-occupied homes in $1000s (medv).
  • proportion of owner-occupied units built prior to 1940 (age) and the weighted mean of distances to five Boston employment centres (dis).
  • proportion of non-retail business acres per town (indus) and the weighted mean of distances to five Boston employment centres (dis).
  • per capita crime rate by town (crim) and the weighted mean of distances to five Boston employment centres (dis).

There are some positive (= more of A correlates to less of B) correlations too:

  • nitrogen oxide concentration (nox) and the proportion of non-retail business acres per town (indus).
  • nitrogen oxide concentration (nox) and per capita crime rate by town (crim)
  • nitrogen oxide concentration (nox) and proportion of owner-occupied units built prior to 1940 (age)
  • index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax).
  • full-value property-tax rate per $10,000 (tax) and per capita crime rate by town (crim)

In summary, some examples of possible correlations:

  • closer mean distance to five Boston employment centres seems to be correlated with higher
    • nitrogen oxide concentration,
    • proportion of owner-occupied units built prior to 1940,
    • proportion of non-retail business acres, and
    • per capita crime rate
  • higher per capita crime rate seems to be correlated with
    • shorter mean distance to five Boston employment centres
    • higher nitrogen oxide concentration
    • higher full-value property-tax rate
  • higher percent of the lower status of the population and lower median value of owner-occupied homes.

4. Data standardization, and division to training and testing sets

We’ll now standardize the data by scaling it. Boston data has only numerical variables, so we’ll be able to standardize them all in one go with scale().

Quoted from our course material In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.

\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]

# center and standardize variables with scale()
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)

After scaling, the means of all variables is standardized to zero (0). Scaling the data with different original scales makes the comparison and analysis of the variables easier and possible. The scale() function transforms the data into a matrix and array so for further analysis we changed it back to a data frame.

Let’s replace the continuous crime rate variable with a categorical variable of low, medium low, medium high and high crime rate (scaled).

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

#check new dataset
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865

We’ll now divide the dataset to training (80%) and testing (20%) datasets for further model fitting and testing.

# number of rows in the scaled Boston dataset 
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set (random 80% of the scaled Boston data)
train <- boston_scaled[ind,]

# create test set (random 20% of the scaled Boston data)
test <- boston_scaled[-ind,]

5. Fitting of linear discriminant analysis (LDA) on the training set

We’ll fit the linear discriminant analysis (LDA), a classification (and dimension reduction) method, on the training set.

  • Target variable: crime (categorical)
  • Predictor variables: zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv
# linear discriminant analysis on the training set
# target variable = crime
# all other variables (.) are the predictor variables
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2500000 0.2425743 0.2450495 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       0.90687350 -0.9518716 -0.123759247 -0.8759950  0.39725807 -0.8691459
## med_low  -0.06632256 -0.3155376 -0.077423115 -0.5789052 -0.09872668 -0.3708941
## med_high -0.38923530  0.1585206  0.249939331  0.3910602  0.13378305  0.4089627
## high     -0.48724019  1.0149946  0.006051757  1.0154551 -0.39793220  0.8130601
##                 dis        rad        tax     ptratio       black        lstat
## low       0.9140276 -0.6752522 -0.7320004 -0.42480724  0.37837408 -0.756513372
## med_low   0.4128068 -0.5429521 -0.4972876 -0.08967776  0.33471164 -0.144859329
## med_high -0.3392582 -0.3701367 -0.2925681 -0.28818321  0.07103188 -0.003258244
## high     -0.8496350  1.6596029  1.5294129  0.80577843 -0.72950646  0.878822375
##                  medv
## low       0.507053095
## med_low   0.008167187
## med_high  0.164853334
## high     -0.673408969
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.10563213  0.606964027 -0.86110143
## indus    0.03204997 -0.400915041  0.52123440
## chas    -0.06234528 -0.083629277  0.01191656
## nox      0.24040856 -0.739897209 -1.28217936
## rm      -0.09747270 -0.158956217 -0.11818800
## age      0.35648775 -0.316573943 -0.28636033
## dis     -0.08072341 -0.273751580  0.15579154
## rad      3.16834975  0.966989084 -0.17763390
## tax      0.06591010  0.033595240  0.45604731
## ptratio  0.15480657 -0.007742879 -0.18847941
## black   -0.16309977 -0.010002950  0.15137491
## lstat    0.18397276 -0.220960096  0.55167946
## medv     0.17529286 -0.331569625 -0.09036433
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9520 0.0361 0.0118

There are three discriminant functions (LD1, LD2, LD3). Let’s biplot the results of these functions:

# the function for lda biplot arrows (variables)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results, all dimensions
plot(lda.fit, dimen = 3, col = classes, pch = classes)

# plot LD1 vs. LD2
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

LD1 and LD2 are the strongest functions to separate our observations on the target variable (as we can also see from the proportion of trace on the summary of the LDA model output and the graphical overview of all the dimensions LD1 through LD3).

The graphical overview of or LDA model shows that LD1 and LD2 clearly separate the group of observations: high per capita crime rate (from our target variable, crime) is separated from the other, lower quartiles. Crime rate seems to be higher when index of accessibility to radial highways is higher. Other variables don’t draw as much attention.

6. Predicting with the LDA model on the testing set

To predict crime rate categories with our LDA model we have to save the true crime categories from the test set, and then remove them from the data to then create a variable with the predicted crime rate classes.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable (true/correct crime rate classes) from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results %>% add sums
tbres1 <- table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
tbres1
##           predicted
## correct    low med_low med_high high Sum
##   low       16       4        1    0  21
##   med_low    5      15        5    0  25
##   med_high   1      10       17    0  28
##   high       0       0        1   27  28
##   Sum       22      29       24   27 102
#number of rows in test set
nrow(test)
## [1] 102

Our model seems to predict crime rates quite nicely:

  • 76.2 % of low crime rate predicted correctly.
  • 60 % of medium low crime rate predicted correctly.
  • 60.7 % of medium high crime rate predicted correctly.
  • 96.4 % of high crime rate predicted correctly.
  • In total, 73.5 % of all 102 crime rate observations predicted correctly.

7. Model optimization with k-means algorithm

We want to investigate the the similarity between our observations, so we’ll create clusters of similar datapoints with running a k-means algorithm on the dataset.

We’ll first standardize the data for further analysis.

# Reload the 'Boston' data
data("Boston")

# Recenter and restandardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)

As instructed for our assignment, we’ll calculate the distances between observations to assess the similarity between our data points. We’ll calculate both euclidean and manhattan distances. It is worth noting that k-means clustering algorithm uses Euclidean distances.

# Calculate distances between the observations.
# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
## Run k-means algorithm on the dataset
## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)
km <- kmeans(boston_scaled, centers = 4)

# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

With kmeans clustering with 4 clusters there seems to be some overlap between clusters (based on the plots). Let’s investigate the optimal number of clusters for our model with the total within sum of squares (WCSS) and how it changes when the number of clusters changes.

# Investigate what is the optimal number of clusters and run the algorithm again
# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line', ylab='Total within sum of squares (WCSS)', xlab='number of clusters')

The optimal number of clusters is when the total WCSS drops radically. Here, the optimal number of clusters for our model seems to be 2. Let’s run the algorithm again with 2 clusters and visualize the results.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)


# Visualize the clusters
# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

pairs(boston_scaled[1:5], col = km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

pairs(boston_scaled[11:14], col = km$cluster)

We determined that grouping our data with k-means clustering is best done with two clustering centroids (-> yields two clusters). However, we do not know what these clusters actually represent.

Form the plots we can see that there are very clear separate clusters at least within rad and tax.

To determine possible reasons for the clustering result we could draw a parallel coordinates plot… See this instruction or this.

Bonus: K-means and biplot visualization

# Reload the 'Boston' data
data("Boston")

# Recenter and restandardize variables
boston_scaled_bonus <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled_bonus)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled_bonus)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled_bonus <- as.data.frame(boston_scaled_bonus)

Let’s choose to run the kmeans algorithm with an arbitrary amount of four clusters.

## Run k-means algorithm with 6 cluster centers on the dataset

## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that (used already in a previous chunck of code in this document)
km_b <- kmeans(boston_scaled_bonus, centers = 4)

# plot the scaled Boston dataset with clusters
#pairs(Boston, col = km$cluster)
#pairs(Boston[6:10], col = km$cluster)

#### Perform LDA using the clusters as target classes

# assess the kmeans() result summary
summary(km_b) #it is not a dataframe
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       56    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
typeof(km_b) #it is a 'list' object
## [1] "list"
str(km_b)
## List of 9
##  $ cluster     : Named int [1:506] 1 1 1 2 2 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ centers     : num [1:4, 1:14] -0.377 -0.408 0.859 -0.199 -0.333 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "1" "2" "3" "4"
##   .. ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
##  $ totss       : num 7070
##  $ withinss    : num [1:4] 1102 623 1380 341
##  $ tot.withinss: num 3446
##  $ betweenss   : num 3624
##  $ size        : int [1:4] 222 98 152 34
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
# --> km seems to be a 'list' of 'lists', with 506 observations.
# More detailed info:
#?kmeans()

# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_b_clust <- km_b$cluster
km_b_clust <- as.data.frame(km_b_clust)
boston_scaled_bonus <- mutate(boston_scaled_bonus, km_b_clust)

# Run linear discriminant analysis with clusters as target classes and all other variables of the data set as pexplanatory variables
lda.fit_bonus <- lda(km_b_clust ~ ., data = boston_scaled_bonus)

# print the lda.fit object
lda.fit_bonus
## Call:
## lda(km_b_clust ~ ., data = boston_scaled_bonus)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.43873518 0.19367589 0.30039526 0.06719368 
## 
## Group means:
##         crim         zn      indus       chas        nox          rm
## 1 -0.3773550 -0.3325348 -0.3302399 -0.2723291 -0.3380896 -0.07661724
## 2 -0.4083186  1.5993013 -1.0868786 -0.2321546 -1.0252997  0.85762197
## 3  0.8588075 -0.4872402  1.1204442 -0.2723291  1.0691487 -0.50270159
## 4 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.27566811
##           age         dis          rad        tax     ptratio      black
## 1 -0.06014166  0.07356985 -0.588184980 -0.6041027 -0.06143853  0.2840492
## 2 -1.22890399  1.28526187 -0.598658222 -0.6419128 -0.74348995  0.3532213
## 3  0.79691805 -0.84588600  1.244794751  1.3179961  0.65687472 -0.6809675
## 4  0.37213224 -0.40333817  0.001081444 -0.0975633 -0.39245849  0.1715427
##        lstat        medv
## 1 -0.2086903  0.03252933
## 2 -0.9364928  0.91875080
## 3  0.9453522 -0.76810974
## 4 -0.1643525  0.57334091
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## crim    -0.034836228  0.01232052  0.13192586
## zn      -0.267361320  0.15788045  1.20991564
## indus    0.020533059 -0.58778276  0.21069583
## chas     5.828653032  0.14949639  0.09257929
## nox      0.026703754 -0.34585368  0.54345497
## rm      -0.068843435  0.07156541  0.37000239
## age      0.098415418 -0.04458773 -0.53934235
## dis      0.053100700  0.23565203  0.47758361
## rad      0.064174139 -0.72837621  0.11551231
## tax      0.033261463 -0.60257854  0.70761129
## ptratio -0.007773957 -0.10230899  0.07726420
## black    0.015155025  0.03258028 -0.02842152
## lstat   -0.133125916 -0.26354444  0.52415255
## medv    -0.146715844  0.14530719  0.54132868
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8121 0.1472 0.0407
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes / clusters as numeric
classes_b <- as.numeric(boston_scaled_bonus$km_b_clust)

# plot the lda results
plot(lda.fit_bonus, col = classes_b, pch = classes_b)

plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 1)

plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 9)

When clustering the Boston data to 4 clusters with the k-means algorithm the most influential linear separators seem to be:

  1. river tracting (chas)
  2. the index of accessibility to radial highways (rad)
  3. full-value property-tax rate (tax)

The results are somewhat different from our results with 2-cluster algorithm. When looking at the full matrix of biplots (LD1, LD2, LD3) we could suspect there being either two or five different clusters. From our earlier analyses we do know though that two clusters seem to be the best clustering option.

OBS! The results changed after running the R code again. To globally answer our assignment, we can say that the three most influential variables are the ones with the longest arrows on the LD1 vs LD2 plot. For easier assessment of the arrows, the second LD1 vs LD2 plot has scaled arrows (see myscale argument).

Super-Bonus: 3D-plotting of classifications and clusters

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

We drew a very cool dynamic 3D plot with the instructions our our course assignment! Let’s customize it:

# add colors
# set the color to be the crime classes of the train set
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes) %>% layout(title = 'Plot A')

To set the colors to be the clusters of the kmeans, we’ll join data with left_join() of dplyr:

# set the color to be the clusters of the kmeans (2 clusters)
# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_sb_clust <- km$cluster
km_sb_clust <- as.data.frame(km_sb_clust)
boston_scaled_sb <- mutate(boston_scaled, km_sb_clust)
# merge data
train_sb = train %>% left_join(boston_scaled_sb)
# check data
head(train_sb);dim(train_sb)
##            zn      indus       chas        nox         rm        age        dis
## 1  1.87100232 -1.0723615 -0.2723291 -0.6100835  0.8388147 -1.4378877  1.2681505
## 2  0.04872402 -0.4761823 -0.2723291 -0.2648919  0.1314594  0.9138948  1.2117800
## 3 -0.48724019  1.0149946 -0.2723291  1.5991427 -0.4976172  0.6865322 -0.9376612
## 4  0.37030254 -1.0446662 -0.2723291  0.7965722  0.7932707  1.1163897 -0.8473829
## 5 -0.48724019 -0.6161168 -0.2723291 -0.9207559 -0.3624084  0.6012712  0.8996287
## 6  0.97058245 -0.7356442 -0.2723291 -1.0502028  0.2994029 -1.7824842  0.8057412
##          rad        tax    ptratio      black      lstat       medv    crime
## 1 -0.5224844 -0.2268768 -0.3951756  0.3428010 -1.1263120  0.9423829      low
## 2 -0.5224844 -0.5769480 -1.5037485  0.3926395  1.0918456 -0.8190411  med_low
## 3  1.6596029  1.5294129  0.8057784 -3.1515905  2.9921233 -1.5366583     high
## 4 -0.5224844 -0.8558183 -2.5199404  0.3861769 -0.8056314  0.8227800 med_high
## 5 -0.7521778 -1.0397541 -0.2566040  0.3950493  0.8607875 -0.6450733  med_low
## 6 -0.2927910 -0.4701466 -1.0880337  0.2950436 -0.5577691  0.4204795  med_low
##         crim km_sb_clust
## 1 -0.4165570           2
## 2 -0.3939564           2
## 3  1.1700894           1
## 4 -0.3437607           2
## 5 -0.3934472           2
## 6 -0.4093292           2
## [1] 404  16
# target classes / clusters as numeric
classes_sb <- as.numeric(train_sb$km_sb_clust)

# Draw the 3D plot
model_predictors_sb <- dplyr::select(train_sb, -crime, -crim, -km_sb_clust)
# check the dimensions
dim(model_predictors_sb)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit$scaling
matrix_product_sb <- as.data.frame(matrix_product_sb)

#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = classes_sb)  %>% layout(title = 'Plot B')

Plot A shows four different colors depicting our four different crime rate quartiles: ‘low = 1’, ‘medium low = 2’, ‘medium high = 3’, ‘high = 4’. Plot B shows coloring by our optimal two clusters. Comparing plots A and B we notice that the dark blue cluster (Plot B) seems to include some observations of ‘medium low’ crime rate classification. And the yellow cluster of plot B seems to include all low, medium low and medium high crime rate classes. Some medium high crime rate classes seem to have been clustered to the dark blue cluster of plot B. In summary, our model seems to cluster separately quite nicely our high crime rate suburbs.